Featuring Movie Popularity Vs. Revenue
The tmdb-movies dataset is composed of movie statistics. It includes data on movie budgets and revenues. Also, the dataset has information on cast members, directors and technical information such as runtime. My analysis will focus on does popularity means increased revenue. Dependent variables I will use are budget, popularity and revenue. Independent variables are release year, date and runtime. I want to see if longer runtimes mean low popularity and or low revenue.
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('DataSets/tmdb-movies.csv')
df.head(3)
I run df.head to view the dataset. I see some columns that may not be used but I do not feel they need to be dropped. For example, homepage and tagline may not be useful in my investigation. But they will not interfer with my analysis. I prefer to keep these columns because questions come up during exploration and I may possibly need them.
# How many rows & columns
df.shape
# Explote data types to insure categories are correct types like strings are objects and numbers are ints or floarts
df.dtypes
# Explore further and notice the release_date is an object
df.info()
# Convert release_date from object to date
df['release_date'] = pd.to_datetime(df['release_date'])
df.info()
# Verify change to release_date
df.head(3)
# check unique data in each column, look for missing data
df.nunique()
# Here we check for null data, which there is null data
# Identifies null values by row
null_data = df[df.isnull().any(axis=1)]
null_data
# identify null values by columns
# Here the missing data is not consequencial becuase they are objects
# homepage for example may not have existed or is taken down, it is not relevant to the questions we need to answer
# therefore I have determined there is no need to fill in null data in this dataset
null_columns = df.columns[df.isnull().any()]
df[null_columns].isnull().sum()
df.describe()
df.hist(figsize=(8, 8));
# For 2015 most popular movie and what revenue did it generate?
ryear2015 = df.query('release_year == "2015"')
df2015 = ryear2015[ryear2015['popularity'] == ryear2015['popularity'].max()]
print(df2015.loc[:,['original_title', 'popularity', 'revenue']])
“Jurassic World” was the most popular movie of 2015, but according to the query below it did not make the most revenue.
ryear2015top5 = ryear2015.nlargest(5, ['popularity'])
ryear2015top5.loc[:,['original_title', 'popularity', 'revenue']]
fig = px.bar(ryear2015top5, x='original_title', y='revenue', color='popularity', title='Top 5 Popular Movies of 2015',
hover_name="original_title")
fig.show()
Of the top 5 in popularity for 2015, “Star Wars: The Force Awakens” had the highest revenue.
Also, note the bar chart is arranged by popularity and we can quickly see the most popular had the lowest revenues.
# Does runtime effect popularity?
# I also included the release_year as color, which is interesting
fig = px.scatter(df, x='popularity', y='runtime', color='release_year', title='Runtime Vs. Popularity', hover_name="original_title")
fig.show()
The Scatter Plot above shows the correlation of runtime and popularity. Here we see the longer the runtime the least the popularity. Likewise, the more popular movies have a lower runtime.
#df.plot(x='budget', y='revenue', kind='scatter');
# Let's see runtime plotted with revenue
df.plot(kind='scatter', x='popularity', y='revenue', color='purple', title='Popularity Vs. Revenue', figsize=(14, 8));
From the Scatter Plot above we see that the highest revenue is not the most popular. There are examples of very popular movies that have generated large revenues but the majority of movies seem to cluster around a low popularity rating.
# I first sought to find the mean of popularity
df['popularity'].mean()
# Second, I seperated popularity into two categories: least popular and most popular, using the mean as the central point.
leastpopular = df['popularity'] <= .64644095
mostpopular = df['popularity'] > .64644095
df.revenue[mostpopular].hist(alpha=0.5, bins=2, label='Most Popular')
df.revenue[leastpopular].hist(alpha=0.5, bins=2, label='Least Popular')
plt.xlabel('Popularity')
plt.ylabel('Counts')
plt.title('Most Popular Movies compared to Least Popular Movies')
plt.legend();
The above histogram is composed of 2 bins. On the y axis we have the sample counts and the x axis shows the popularity. By finding the mean I split the samples. The least popular had more samples but collectively generated less revenue. The most popular movies generated more revenue but had fewer number of samples.
df.popularity[mostpopular]
df.popularity[leastpopular]
df.revenue[mostpopular].sum()
df.revenue[leastpopular].sum()
The four cells above show further queries. The most popular printed out; note the length which is displayed in the Histogram as the sample count. I also wanted to see the revenue amount. Most popular is 372,935,640,778 billion dollars compared to least popular at 59,784,552,097 billion.
First of all, I investigated different groups to see if popular movies generated larger revenues compared to movies that are low in popularity. We started with an isolated case from 2015, pulling out the data on the 5 most popular movies that year. Clearly, we saw from the chart that most popular movies had lower revenues compared to some titles that were not as popular.
Next, I constructed some scatter plots to compare variables like runtime to popularity to see if that made a difference. Also, we looked at Runtime and revenue, then revenue compared with popularity.
Finally, I figured the mean of popularity and split the data into two samples: least popular and most popular. We see that the least popular movies had more samples but the most popular collectively made greater revenue.
To conclude, you can find instances where a highly rated popular movie will gross lower revenue compared to less popular movies. However, larger sample groups show the most popular movies will generate large revenues.